Introduction

In this project, we will analyze a database containing data on various aspects of residential homes in Ames, Iowa.

Our initial step involves a comprehensive exploratory data analysis to identify potential missing values and outliers. We will make decisions to address these issues.

Secondly, we will conduct a Principal Component Analysis (PCA). This technique aims to condense information from the original variables into a few linear combinations. The objective is to achieve dimensionality reduction while maximizing variance. These linear combinations are designed to be perpendicular to each other, aligning with the directions of maximum variance and ensuring lack of correlation.

Next, we will perform Factor Analysis (FA), identifying latent variables that exhibit a high correlation with specific groups of observable variables and minimal correlation with others. FA facilitates dimensionality reduction by capturing the underlying structure in the data.

In the final stage, we will execute both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). Prior to these analyses, we will verify the necessary assumptions of normality. Discriminant Analysis is a classification method for qualitative variables. It allows the categorization of new observations based on their characteristics (explanatory or predictor variables) into different categories of the qualitative response variable.Data pre-processing

Imports

# Package required to call 'freq' and 'descr' functions (descriptive statistics)
library(summarytools)

# Package required to call 'ggplot' function (graphical tools)
library(ggplot2)

# Package required to call 'read.spss' function (loading '.spss' data format)
library(foreign)

# Package required to call 'read_xlsx' function (loading '.xlsx' data format)
library(readxl)

# Package required to load the data set 'RBGlass1'
library(archdata)

library(ggpubr)

# Package required to call 'fviz_pca_var, fviz_pca_ind and fviz_pca' functions
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Package required to call 'scatterplot3d' function
library(scatterplot3d)

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Data Base Description

The file train.csv has the following data:

  • GrLivArea: Above grade (ground) living area square feet

  • GarageArea: Area of the garage

  • X1stFlrSF: Area of the first floor

  • LotArea: Lot size in square feet

  • OverallQual: Rates the overall material and finish of the house

  • YearBuilt: Year of construction

  • SalePrice: Sale price

Load Data from a local file

The data file has a .csv extension, so the read.csv function is used to load the file.

data <- read.csv("train.csv")

Select the columns mentioned before:

data <- data[c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt", "SalePrice")]
colnames(data)
## [1] "GrLivArea"   "GarageArea"  "X1stFlrSF"   "LotArea"     "OverallQual"
## [6] "YearBuilt"   "SalePrice"

Identifying and treatment of missing values (NA)

Initially, we examine whether the database (BD) contains missing values. To accomplish this, we create the following function to calculate the percentage of missing values in a vector. Subsequently, we apply this function to the entire data frame using the apply function.

percentageNA<-function(data){
  c=(sum(is.na(data)))/length(data)*100
  return(c)
}
contNA<-apply(data,2,percentageNA)
contNA
##   GrLivArea  GarageArea   X1stFlrSF     LotArea OverallQual   YearBuilt 
##           0           0           0           0           0           0 
##   SalePrice 
##           0

We don’t have missing data so we can continue, directly, with the next step.

Classic Numerical Descriptive Analysis

We will conduct a preliminary exploratory data analysis of the dataset. As the variables are quantitative, we will present basic numerical descriptive statistics along with visual representations such as histograms, density plots, and boxplots.

for (col in c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt", "SalePrice")) {
  print(descr(data[col]))
}
## Descriptive Statistics  
## data$GrLivArea  
## N: 1460  
## 
##                     GrLivArea
## ----------------- -----------
##              Mean     1515.46
##           Std.Dev      525.48
##               Min      334.00
##                Q1     1129.00
##            Median     1464.00
##                Q3     1777.50
##               Max     5642.00
##               MAD      483.33
##               IQR      647.25
##                CV        0.35
##          Skewness        1.36
##       SE.Skewness        0.06
##          Kurtosis        4.86
##           N.Valid     1460.00
##         Pct.Valid      100.00
## Descriptive Statistics  
## data$GarageArea  
## N: 1460  
## 
##                     GarageArea
## ----------------- ------------
##              Mean       472.98
##           Std.Dev       213.80
##               Min         0.00
##                Q1       333.00
##            Median       480.00
##                Q3       576.00
##               Max      1418.00
##               MAD       177.91
##               IQR       241.50
##                CV         0.45
##          Skewness         0.18
##       SE.Skewness         0.06
##          Kurtosis         0.90
##           N.Valid      1460.00
##         Pct.Valid       100.00
## Descriptive Statistics  
## data$X1stFlrSF  
## N: 1460  
## 
##                     X1stFlrSF
## ----------------- -----------
##              Mean     1162.63
##           Std.Dev      386.59
##               Min      334.00
##                Q1      882.00
##            Median     1087.00
##                Q3     1391.50
##               Max     4692.00
##               MAD      347.67
##               IQR      509.25
##                CV        0.33
##          Skewness        1.37
##       SE.Skewness        0.06
##          Kurtosis        5.71
##           N.Valid     1460.00
##         Pct.Valid      100.00
## Descriptive Statistics  
## data$LotArea  
## N: 1460  
## 
##                       LotArea
## ----------------- -----------
##              Mean    10516.83
##           Std.Dev     9981.26
##               Min     1300.00
##                Q1     7549.00
##            Median     9478.50
##                Q3    11603.00
##               Max   215245.00
##               MAD     2962.23
##               IQR     4048.00
##                CV        0.95
##          Skewness       12.18
##       SE.Skewness        0.06
##          Kurtosis      202.26
##           N.Valid     1460.00
##         Pct.Valid      100.00
## Descriptive Statistics  
## data$OverallQual  
## N: 1460  
## 
##                     OverallQual
## ----------------- -------------
##              Mean          6.10
##           Std.Dev          1.38
##               Min          1.00
##                Q1          5.00
##            Median          6.00
##                Q3          7.00
##               Max         10.00
##               MAD          1.48
##               IQR          2.00
##                CV          0.23
##          Skewness          0.22
##       SE.Skewness          0.06
##          Kurtosis          0.09
##           N.Valid       1460.00
##         Pct.Valid        100.00
## Descriptive Statistics  
## data$YearBuilt  
## N: 1460  
## 
##                     YearBuilt
## ----------------- -----------
##              Mean     1971.27
##           Std.Dev       30.20
##               Min     1872.00
##                Q1     1954.00
##            Median     1973.00
##                Q3     2000.00
##               Max     2010.00
##               MAD       37.06
##               IQR       46.00
##                CV        0.02
##          Skewness       -0.61
##       SE.Skewness        0.06
##          Kurtosis       -0.45
##           N.Valid     1460.00
##         Pct.Valid      100.00
## Descriptive Statistics  
## data$SalePrice  
## N: 1460  
## 
##                     SalePrice
## ----------------- -----------
##              Mean   180921.20
##           Std.Dev    79442.50
##               Min    34900.00
##                Q1   129950.00
##            Median   163000.00
##                Q3   214000.00
##               Max   755000.00
##               MAD    56338.80
##               IQR    84025.00
##                CV        0.44
##          Skewness        1.88
##       SE.Skewness        0.06
##          Kurtosis        6.50
##           N.Valid     1460.00
##         Pct.Valid      100.00
attach(data)

p1<-ggplot(data,aes(x=GrLivArea))+geom_density()+
  labs(title = paste("Density function of", "GrLivArea"),x="GrLivArea",y="Values")

p2<-ggplot(data,aes(x=GrLivArea))+geom_histogram()+
  labs(title = paste("Histogram of", "GrLivArea"), x="GrLivArea",y="Values")

p3<-ggplot(data,aes(x=GrLivArea))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "GrLivArea"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=GarageArea))+geom_density()+
  labs(title = paste("Density function of", "GarageArea"),x="GarageArea",y="Values")

p2<-ggplot(data,aes(x=GarageArea))+geom_histogram()+
  labs(title = paste("Histogram of", "GarageArea"), x="GarageArea",y="Values")

p3<-ggplot(data,aes(x=GarageArea))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "GarageArea"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=X1stFlrSF))+geom_density()+
  labs(title = paste("Density function of", "X1stFlrSF"),x="X1stFlrSF",y="Values")

p2<-ggplot(data,aes(x=X1stFlrSF))+geom_histogram()+
  labs(title = paste("Histogram of", "X1stFlrSF"), x="X1stFlrSF",y="Values")

p3<-ggplot(data,aes(x=X1stFlrSF))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "X1stFlrSF"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=LotArea))+geom_density()+
  labs(title = paste("Density function of", "LotArea"),x="LotArea",y="Values")

p2<-ggplot(data,aes(x=LotArea))+geom_histogram()+
  labs(title = paste("Histogram of", "LotArea"), x="LotArea",y="Values")

p3<-ggplot(data,aes(x=LotArea))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "LotArea"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=OverallQual))+geom_density()+
  labs(title = paste("Density function of", "OverallQual"),x="OverallQual",y="Values")

p2<-ggplot(data,aes(x=OverallQual))+geom_histogram()+
  labs(title = paste("Histogram of", "OverallQual"), x="OverallQual",y="Values")

p3<-ggplot(data,aes(x=OverallQual))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "OverallQual"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=YearBuilt))+geom_density()+
  labs(title = paste("Density function of", "YearBuilt"),x="YearBuilt",y="Values")

p2<-ggplot(data,aes(x=YearBuilt))+geom_histogram()+
  labs(title = paste("Histogram of", "YearBuilt"), x="YearBuilt",y="Values")

p3<-ggplot(data,aes(x=YearBuilt))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "YearBuilt"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1<-ggplot(data,aes(x=SalePrice))+geom_density()+
  labs(title = paste("Density function of", "SalePrice"),x="SalePrice",y="Values")

p2<-ggplot(data,aes(x=SalePrice))+geom_histogram()+
  labs(title = paste("Histogram of", "SalePrice"), x="SalePrice",y="Values")

p3<-ggplot(data,aes(x=SalePrice))+
  geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
  coord_flip()+labs(title = paste("Boxplot of", "SalePrice"),x="Values",y="")

# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Outliers Treatment

To identify outliers, a prior graphical exploratory analysis must be conducted by creating boxplots for all variables.

Since all variables are quantitative, the decision has been made to replace outliers with the mean. To achieve this, the outlier function has been developed, which detects and manipulates outliers. It is worth noting that the decision to replace outliers with the mean would require a prior analysis of the cause of these atypical values.

boxplot(data,main="Exploratory data analysis",
        xlab="House features",
        ylab="Value",
        col=c(1:ncol(data)))

outlier<-function(data,na.rm=T){
  H<-1.5*IQR(data)
  data[data<quantile(data,0.25, na.rm = T)-H]<-NA
  data[data>quantile(data,0.75, na.rm = T)+H]<-NA
  data[is.na(data)]<-mean(data, na.rm = T)

  H<-1.5*IQR(data)

  if (TRUE %in% (data<quantile(data,0.25,na.rm = T)-H) |
      TRUE %in% (data>quantile(data,0.75,na.rm = T)+H))
    outlier(data)
  else
    return(data)
}

data<-as.data.frame(apply(data,2,outlier))


par(mfrow=c(1,2))
# Fixed data
boxplot(data,main="Data without outliers",
        xlab="House features",
        ylab="Values",
        col=c(1:ncol(data)))

The analysis of outliers is repeated below, but this time with normalized data to enhance the visualization of all outliers.

normalized_data<-scale(data)
# En el gráfico se observa que muchas variables presentan outliers.
boxplot(normalized_data,main="Exploratory data analysis",
        xlab="House features",
        ylab="Value",
        col=c(1:ncol(data)))

Principal component analysis

Correlation

Firstly, it is necessary to verify that the variables are not independent. At the sample level collected in the database, this can be done by calculating and observing the correlation matrix. At the population level, the justification for correlation can be established by performing the Bartlett test (the Bartlett sphericity test checks whether the correlations are significantly different from 0. The null hypothesis is that \(det(R) = 1\), where \(R\) is the correlation matrix. The following code performs the two mentioned checks.

cor(data)
##               GrLivArea GarageArea  X1stFlrSF      LotArea OverallQual
## GrLivArea   1.000000000 0.41091556 0.41252812  0.003349105  0.53890566
## GarageArea  0.410915560 1.00000000 0.46060093  0.036038044  0.54785260
## X1stFlrSF   0.412528115 0.46060093 1.00000000  0.048127106  0.42530933
## LotArea     0.003349105 0.03603804 0.04812711  1.000000000  0.06670334
## OverallQual 0.538905657 0.54785260 0.42530933  0.066703337  1.00000000
## YearBuilt   0.251413273 0.48573031 0.28939644 -0.017380668  0.59009075
## SalePrice   0.544805293 0.48223074 0.39358026 -0.008347953  0.58579926
##               YearBuilt    SalePrice
## GrLivArea    0.25141327  0.544805293
## GarageArea   0.48573031  0.482230739
## X1stFlrSF    0.28939644  0.393580261
## LotArea     -0.01738067 -0.008347953
## OverallQual  0.59009075  0.585799260
## YearBuilt    1.00000000  0.541428580
## SalePrice    0.54142858  1.000000000
normalized_data<-scale(data)
cortest.bartlett(cor(normalized_data))
## Warning in cortest.bartlett(cor(normalized_data)): n not specified, 100 used
## $chisq
## [1] 218.4913
## 
## $p.value
## [1] 8.031927e-35
## 
## $df
## [1] 21

In light of the test results \((p < 0.001)\), it can be stated that the data are not independent. Therefore, it makes sense to consider dimensionality reduction through Principal Component Analysis (PCA) or Factor Analysis (FA).

Other ways to observe that the variables are correlated, which will be useful for Factor Analysis, include:

  • The polychoric correlation matrix:
#install.packages("polycor")
suppressMessages(library(polycor))
## Warning: package 'polycor' was built under R version 4.3.2
#install.packages("ggcorrplot")
suppressMessages(library(ggcorrplot))
## Warning: package 'ggcorrplot' was built under R version 4.3.2
poly_cor<-hetcor(data)$correlations
ggcorrplot(poly_cor, type="lower",hc.order=T)

  • With corrplot:
#install.packages("corrplot")
suppressMessages(library(corrplot))

corrplot(cor(data), order = "hclust", tl.col='black', tl.cex=1)

  • With corrr:
#install.packages("corrr")
suppressMessages(library(corrr))
## Warning: package 'corrr' was built under R version 4.3.2
data_correlations <- correlate(data)  #Cálculo de un objeto de correlaciones.
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
rplot(data_correlations, legend = TRUE, colours = c("firebrick1", "black","darkcyan"), print_cor = TRUE) 

Based on the results obtained in the previous sections, it is now possible to proceed with dimensionality reduction through either Principal Component Analysis or Factor Analysis. The data exhibit correlations, there are no missing values, and the presence of outliers has been addressed.

Doing the PCA and summary of the information

The following code performs Principal Component Analysis (PCA), obtaining the eigenvectors that generate each component, as well as their eigenvalues, which correspond to the variance of each component.

PCA<-prcomp(data, scale=T, center = T)
PCA$rotation
##                    PC1          PC2         PC3         PC4         PC5
## GrLivArea   -0.3845483 -0.007471144  0.53047508  0.52719642  0.17343324
## GarageArea  -0.4155496  0.027712787 -0.06793035 -0.40089276  0.77286107
## X1stFlrSF   -0.3547808  0.111094486  0.49421252 -0.63164333 -0.46533385
## LotArea     -0.0228391  0.981395441 -0.13124212  0.10059078 -0.02738571
## OverallQual -0.4582233  0.033760363 -0.14708590  0.15744458  0.01843438
## YearBuilt   -0.3879810 -0.121223988 -0.65348617 -0.09627253 -0.26231022
## SalePrice   -0.4388077 -0.088668881 -0.06190899  0.34380218 -0.29355517
##                     PC6         PC7
## GrLivArea    0.14794959 -0.49067171
## GarageArea  -0.25213995  0.01822619
## X1stFlrSF    0.04024664 -0.02034391
## LotArea     -0.06741260 -0.06085992
## OverallQual  0.64591076  0.57002461
## YearBuilt    0.11046995 -0.56342936
## SalePrice   -0.69207499  0.33527687
PCA$sdev
## [1] 1.8282024 1.0058011 0.8989781 0.8150016 0.6918485 0.6343287 0.5409468
summary(PCA)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.8282 1.0058 0.8990 0.81500 0.69185 0.63433 0.5409
## Proportion of Variance 0.4775 0.1445 0.1154 0.09489 0.06838 0.05748 0.0418
## Cumulative Proportion  0.4775 0.6220 0.7375 0.83234 0.90071 0.95820 1.0000

The following graphs illustrate the behavior of the variance explained by each principal component, as well as the cumulative explained variance.

suppressMessages(library(ggplot2))


# Explained variance
explained_var <- PCA$sdev^2 / sum(PCA$sdev^2)
ggplot(data = data.frame(explained_var, pc = 1:ncol(data)),
       aes(x = pc, y =  explained_var, fill=explained_var )) +
  geom_col(width = 0.3) +
  scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
  labs(x = "Principal component", y= " Proportion of the explained variance")

accum_var<-cumsum(explained_var)
ggplot( data = data.frame(accum_var, pc = 1:ncol(data)),
        aes(x = pc, y = accum_var ,fill=accum_var )) +
  geom_col(width = 0.5) +
  scale_y_continuous(limits = c(0,1)) +
  theme_bw() +
  labs(x = "Principal component",
       y = "Proportion of the accumulated explained variance")

Optimal number of components

To choose the appropriate number of principal components, the following method is employed: The variances explained by the principal components are averaged, and those components whose proportion of explained variance surpasses the mean are selected.

PCA$sdev^2
## [1] 3.3423241 1.0116359 0.8081616 0.6642276 0.4786544 0.4023729 0.2926235
mean(PCA$sdev^2)
## [1] 1

By Rule of Abdi, 2 components are chosen. Each principal component is obtained as the linear combination of all variables with the coefficients shown in columns of the rotation matrix.

Some plots of interest

The following graphs display a comparison between different principal components through a two-dimensional projection. In this representation, it is evident which of the original variables carry greater or lesser weight in each of the juxtaposed components.

suppressMessages(library("factoextra"))

# Comparativa entre la 1ª y 2ª componente principal.
fviz_pca_var(PCA,
             repel=TRUE,col.var="cos2",
             legend.title="Distance")+theme_bw()

fviz_pca_var(PCA,axes=c(1,3),
             repel=TRUE,col.var="cos2",
             legend.title="Distance")+theme_bw()

fviz_pca_var(PCA,axes=c(2,3),
             repel=TRUE,col.var="cos2",
             legend.title="Distance")+theme_bw()

It is also possible to represent the observations of the objects along with the principal components using the “contrib” argument of the previous “fviz_pca_ind” function. Additionally, observations that explain higher variance can be identified with colors.

fviz_pca_ind(PCA,col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var")+theme_bw()

Factorial analysis

Different estimation methods

To extract the factors, a method of estimation must be chosen. The fa() function is used for this purpose, as it allows the implementation of up to 6 different methods.

Next, the outputs are compared using the principal factor method and the maximum likelihood method. Three factors are calculated, as suggested by the correlation graphs, although it is not entirely clear in this example.

# Likelihood method
model1<-fa(poly_cor,
            nfactors = 3, 
            rotate = "none",
            fm="mle")

# minimal residual model
model2<-fa(poly_cor,
            nfactors = 3,
            rotate = "none",
            fm="minres") 

The communalities of both models are compared to see what proportion of the variance is explained by the latent factors.

sort(model1$communality,decreasing = T)->c1
sort(model2$communality,decreasing = T)->c2
head(cbind(c1,c2))
##                    c1        c2
## YearBuilt   0.8733819 0.8779251
## GrLivArea   0.7817811 0.7882180
## OverallQual 0.6222477 0.6205131
## GarageArea  0.5676882 0.5587136
## SalePrice   0.5566391 0.5586521
## X1stFlrSF   0.3969742 0.4021478

The uniqueness, which represents the proportion of variance not explained by the factor (1-communality), is also compared.

sort(model1$uniquenesses,decreasing = T)->u1
sort(model2$uniquenesses,decreasing = T)->u2
head(cbind(u1,u2))
##                    u1        u2
## LotArea     0.9833749 0.9853618
## X1stFlrSF   0.6030258 0.5978522
## SalePrice   0.4433609 0.4413479
## GarageArea  0.4323118 0.4412864
## OverallQual 0.3777523 0.3794869
## GrLivArea   0.2182189 0.2117820

We can see that both are quite similar, since we are obtaining very closed values.

Searching for the optimal amount of Factors

Different criteria exist for determining the optimal number of factors. The elbow method is implemented below.

The elbow method involves creating a scree plot, obtained by plotting eigenvalues in decreasing order on the y-axis and the corresponding component numbers for principal components or latent factors (depending on whether PCA or FA is being conducted) on the x-axis. Connecting the points creates a figure that starts with a steep slope and then transitions to a gentler incline. The method suggests selecting the number of factors up to the “elbow” of the plot, where the steep slope gives way to the gentler incline.

It is called a scree plot because it resembles the profile of a mountain with a steep slope leading to the base, where pebbles fallen from the summit accumulate (where they “sediment”).

scree(poly_cor)

fa.parallel(poly_cor,n.obs=200,fa="fa",fm="minres")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA

We can see that the optimum number of factors is 4. Another form of calculating it is with the following tests:

library(stats)
factanal(data,factors=3,rotation="none") # 4 gives error
## 
## Call:
## factanal(x = data, factors = 3, rotation = "none")
## 
## Uniquenesses:
##   GrLivArea  GarageArea   X1stFlrSF     LotArea OverallQual   YearBuilt 
##       0.218       0.432       0.603       0.983       0.378       0.127 
##   SalePrice 
##       0.443 
## 
## Loadings:
##             Factor1 Factor2 Factor3
## GrLivArea    0.594   0.637  -0.151 
## GarageArea   0.659   0.112   0.347 
## X1stFlrSF    0.491   0.258   0.298 
## LotArea                      0.124 
## OverallQual  0.774   0.139         
## YearBuilt    0.840  -0.404         
## SalePrice    0.725   0.172         
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.859   0.699   0.257
## Proportion Var   0.408   0.100   0.037
## Cumulative Var   0.408   0.508   0.545
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 11.85 on 3 degrees of freedom.
## The p-value is 0.0079

We can see that 4 gives us error, so we should take 3 as the number of optimum factors.

Estimation of the model

Finally, the factorial model with 3 factors is estimated. To achieve this, a varimax rotation is implemented to seek a simpler interpretation of reality.

varimax_model<-fa(poly_cor,nfactors = 3,rotate = "varimax",
                   fa="mle")

The rotated factorial weight matrix is:

print(varimax_model$loadings,cut=0) 
## 
## Loadings:
##             MR1    MR2    MR3   
## GrLivArea    0.150  0.873  0.065
## GarageArea   0.493  0.355  0.435
## X1stFlrSF    0.273  0.393  0.416
## LotArea     -0.005  0.003  0.121
## OverallQual  0.573  0.501  0.204
## YearBuilt    0.927  0.132 -0.042
## SalePrice    0.516  0.533  0.093
## 
##                  MR1   MR2   MR3
## SS loadings    1.793 1.594 0.434
## Proportion Var 0.256 0.228 0.062
## Cumulative Var 0.256 0.484 0.546

Visually, one could make an effort to see which variables each of the factors correlates with in the factorial matrix. However, this is quite tedious, so the following representation is used:

fa.diagram(varimax_model)

We can see that the first latent factor correlations YearBuilt, OverallQual, GarageArea, the second one correlations GrLivArea, SalePrice and the third one X1stFlrSF.

Linear discriminant analysis

The database contains information on various indicators measured across different residential homes, including the house price. Therefore, it would be interesting to provide a classification method for the price level based on other indicators.

To achieve this, a qualitative variable is defined as the response variable for classification, called “precio” (price). This variable has two categories:

price=c()
mean_price = mean(data$SalePrice)

for(k in 1:nrow(data)){
  if(data$SalePrice[k]<mean_price){
  price[k] = "low"
  }
  else
    price[k] = "high"
}

price<-as.factor(price)
price
##    [1] high high high low  high low  high high low  low  low  high low  high
##   [15] low  low  low  low  high low  high low  high low  low  high low  high
##   [29] high low  high low  high high high high low  low  low  low  high high
##   [43] low  low  low  high high high low  low  high low  low  high low  high
##   [57] high high high low  high low  high low  high high high high low  high
##   [71] high low  high low  low  low  low  low  low  low  high low  high low 
##   [85] high high high high low  low  low  low  high low  high high high low 
##   [99] low  low  high high low  high high high low  low  low  high low  high
##  [113] high high high high low  low  high high high low  low  low  high low 
##  [127] low  low  low  low  high high low  high high high low  high high high
##  [141] low  high high high low  low  low  high low  low  low  high high high
##  [155] low  low  low  high high high high high high low  low  low  high high
##  [169] high high low  high high high high high high high high low  high high
##  [183] low  high low  high high low  low  high high high high low  low  low 
##  [197] high high low  high low  high low  low  low  high low  low  high low 
##  [211] low  high high low  high low  high low  high high high high high low 
##  [225] high low  high low  low  high low  high low  low  high low  high high
##  [239] high low  high low  low  low  high high low  low  high high low  high
##  [253] high high low  high high high high low  high high low  low  low  high
##  [267] high high low  low  high high high low  low  high high low  high high
##  [281] high high high high high high high low  low  low  high low  low  high
##  [295] high low  low  high high high low  high high low  high high high low 
##  [309] low  high high low  low  high high high high high high high high high
##  [323] high low  high low  high low  high low  low  low  high high high high
##  [337] high high high low  high low  low  high low  low  low  low  low  high
##  [351] high high low  low  low  high high low  low  high low  low  high low 
##  [365] high low  high high low  high high low  low  low  high low  low  high
##  [379] high high low  high high low  high high low  low  high high low  high
##  [393] low  low  low  low  low  high low  high high high low  high high low 
##  [407] low  high high high low  low  high low  high high low  high low  low 
##  [421] high high low  high low  low  high low  high high low  low  low  high
##  [435] low  high low  low  low  low  high low  high high high low  high high
##  [449] low  low  low  high high high high high low  high high low  high low 
##  [463] low  high low  high high low  high high high high low  high high low 
##  [477] high high high low  high high low  high low  low  low  high high low 
##  [491] low  low  high low  low  high high high low  low  low  high low  high
##  [505] low  low  high high high low  high high low  low  low  high high high
##  [519] high high low  low  high high high high low  high low  high high low 
##  [533] low  high high low  high low  high high high high high low  high high
##  [547] high low  low  high low  low  high low  high low  low  low  high high
##  [561] low  high low  high high low  high high high low  low  low  high high
##  [575] low  low  low  high low  low  high high low  high low  high low  low 
##  [589] low  low  high high low  low  low  high low  high high low  high low 
##  [603] high low  high high low  high high low  high low  high low  low  low 
##  [617] high low  high high low  high low  high high high low  low  low  high
##  [631] low  high low  low  low  high low  low  low  high high high high low 
##  [645] high low  low  low  low  low  high low  high low  high low  low  low 
##  [659] low  high high high low  low  high high low  high high low  high low 
##  [673] high high low  low  low  low  high low  low  high high high high high
##  [687] high low  high high low  high high low  low  high low  low  low  high
##  [701] high low  high low  high low  high high high low  high low  high low 
##  [715] low  high high low  high low  high low  low  low  high low  high high
##  [729] low  low  high high high low  low  high low  high high high low  low 
##  [743] high high high high high high high low  low  high high high low  high
##  [757] high high high high low  low  high high high high high high high high
##  [771] low  low  low  low  high high high low  low  low  high high high high
##  [785] low  high low  high low  high high low  high high high high low  low 
##  [799] high high high low  high high low  high low  high high low  high low 
##  [813] low  low  low  high low  high low  high high low  high low  high high
##  [827] low  high high low  high low  high high low  low  low  low  low  low 
##  [841] low  low  high low  low  high high low  high high low  high high high
##  [855] high low  low  high low  high high low  low  low  high low  high low 
##  [869] high high low  high low  low  low  high low  high low  low  low  high
##  [883] high low  low  high low  low  high low  low  high low  high low  low 
##  [897] low  low  high low  low  low  high high low  low  high high low  high
##  [911] low  low  low  low  high low  high low  high high high low  high high
##  [925] high high high high high high high low  high high high low  high high
##  [939] high high low  high low  low  low  low  low  high high high low  low 
##  [953] low  high low  low  low  low  high low  low  high low  high high high
##  [967] high low  high low  low  high low  high high high low  high low  low 
##  [981] high high high high low  low  low  high high high high high high high
##  [995] high low  low  high low  high low  low  high low  high low  high low 
## [1009] high low  low  low  high low  low  high high high high high high high
## [1023] low  high high low  high high low  low  high high high high low  low 
## [1037] high high low  low  low  high high high high low  high low  low  low 
## [1051] high high high low  high high high high high high high low  low  low 
## [1065] low  high high high low  low  low  low  low  high high high high low 
## [1079] low  low  low  low  high high high low  low  high low  high low  high
## [1093] low  low  low  high low  high low  low  low  low  low  high low  high
## [1107] high high high high high high low  low  low  high high low  low  low 
## [1121] low  high low  low  high low  high high high low  low  low  low  high
## [1135] high low  low  low  high low  low  high high low  low  low  high high
## [1149] low  low  low  low  high low  high high high high high high low  high
## [1163] low  low  high high high high high high high high high high high high
## [1177] low  low  low  low  high high high low  high low  low  high high high
## [1191] high high low  high high high high low  high low  low  high low  high
## [1205] low  high low  high low  high high high low  low  low  low  low  high
## [1219] low  low  low  low  low  low  high low  high low  high low  high low 
## [1233] low  low  low  low  high high low  high high high high high high high
## [1247] high high low  low  high high low  high high low  high low  high low 
## [1261] high low  high high high high low  high high low  high high low  high
## [1275] low  low  high high high low  high high low  low  high low  low  high
## [1289] high high high low  low  high low  low  low  low  high low  high high
## [1303] high high low  high high low  low  high high high high high low  high
## [1317] high high high low  low  low  high low  low  low  low  low  high high
## [1331] high low  low  low  low  high low  high high low  low  low  high high
## [1345] low  low  high high high low  high high low  high high high low  low 
## [1359] high high high high low  low  low  high high low  low  high low  high
## [1373] high high high high low  low  low  high low  high low  low  low  low 
## [1387] high low  high low  high low  low  high high high high low  low  low 
## [1401] low  high high high low  high low  low  low  high high low  low  high
## [1415] high high low  high low  high high low  low  high low  low  high low 
## [1429] low  high high low  low  high high high low  high low  high high low 
## [1443] high low  high low  low  high low  low  low  high low  low  high high
## [1457] high high low  low 
## Levels: high low

Firstly, we explore how well (or poorly) each of the independently measured variables classifies the profits.

suppressMessages(library(ggpubr))

p1 <- ggplot(data = data, aes(x = GrLivArea, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)
p2 <- ggplot(data = data, aes(x = GarageArea, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)
p3 <- ggplot(data = data, aes(x = X1stFlrSF, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)
p4 <- ggplot(data = data, aes(x = LotArea, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)
p5 <- ggplot(data = data, aes(x = OverallQual, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)
p6 <- ggplot(data = data, aes(x = YearBuilt, fill = price)) +
      geom_histogram(position = "identity", alpha = 0.5)

p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p5
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p6
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It doesn’t seem that any single variable adequately separates the price. Next, we explore which pairs of variables better differentiate between price levels.

pairs(x = data[, c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt")],
      col = c("firebrick", "green3")[price], pch = 19)

Normality uni and multivariate

Next, a graphical exploration of the normality of individual distributions of our predictors is conducted by representing histograms and qqplots.

Individual distributions

par(mfcol = c(2, 3))
for (k in c(1,2,3,5,6,7)) {
  j0 <- names(data)[k]
  x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
  for (i in 1:2) {
    i0 <- levels(price)[i]
    x <- data[price == i0, j0]
    hist(x, proba = T, col = grey(0.8), main = paste("Price", i0), xlab = j0)
    lines(x0, dnorm(x0, mean(x), sd(x)), col = "red", lwd = 2)
  }
}

qqplots graphics

par(mfrow=c(2,3))
for (k in c(1,2,3,5,6,7)) {
  j0 <- names(data)[k]
  x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
  for (i in 1:2) {
    i0 <- levels(price)[i]
    x <- data[price == i0, j0]
    qqnorm(x, main = paste("price", i0, j0), pch = 19, col = i + 1)
    qqline(x)
  }
}

This exploratory analysis can provide an idea of the possible normal distribution of univariate variables, but it is always better to perform the respective tests of normality.

Normality test (Shapiro-Wilks)

library(reshape2)
shap<-function(x){shapiro.test(x)$p.value}

datos1=data
datos1$price=price

datos_tidy <- melt(datos1, value.name = "value")
## Using price as id variables
aggregate(value ~ price + variable, data = datos_tidy,
          FUN = shap)
##    price    variable        value
## 1   high   GrLivArea 1.733750e-10
## 2    low   GrLivArea 1.327903e-16
## 3   high  GarageArea 3.331606e-13
## 4    low  GarageArea 3.184144e-12
## 5   high   X1stFlrSF 1.213088e-06
## 6    low   X1stFlrSF 1.327458e-12
## 7   high     LotArea 1.488615e-22
## 8    low     LotArea 4.731465e-21
## 9   high OverallQual 1.385541e-18
## 10   low OverallQual 3.517968e-22
## 11  high   YearBuilt 7.327021e-30
## 12   low   YearBuilt 5.072653e-09
## 13  high   SalePrice 1.363801e-24
## 14   low   SalePrice 6.489266e-15

As observed in the results above, every p-value obtained is below the selected significance level. Consequently, we are unable to assume normality for any variable.

Multivariate normality

The MVN package in R provides functions for conducting three commonly used tests to assess multivariate normality. Additionally, it includes functions for analyzing outliers, as multivariate normality can be influenced by the presence of outliers.

suppressMessages(library(MVN))
## Warning: package 'MVN' was built under R version 4.3.2
royston_test <- mvn(data = data, mvnTest = "royston", multivariatePlot = "qq")

royston_test$multivariateNormality
##      Test        H       p value MVN
## 1 Royston 565.0778 1.532065e-117  NO
hz_test <- mvn(data = data, mvnTest = "hz")
hz_test$multivariateNormality
##            Test       HZ p value MVN
## 1 Henze-Zirkler 5.632571       0  NO

So we cannot assume normality.

Discriminant function

We do not have the normality assumption, but we will continue with the analysis (assuming that this condition holds, which may lead to a result that is not entirely accurate or valid, depending on the context of the analysis):

suppressMessages(library(MASS))
modelo_lda <- lda( price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt ,data = data)
modelo_lda
## Call:
## lda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + 
##     YearBuilt, data = data)
## 
## Prior probabilities of groups:
##      high       low 
## 0.5356164 0.4643836 
## 
## Group means:
##      GrLivArea GarageArea X1stFlrSF  LotArea OverallQual YearBuilt
## high  1684.239   553.1393  1268.873 9392.579    6.942591  1986.734
## low   1200.428   360.8582   986.655 9392.747    5.141750  1954.392
## 
## Coefficients of linear discriminants:
##                       LD1
## GrLivArea   -0.0014546451
## GarageArea  -0.0003755866
## X1stFlrSF   -0.0004928186
## LotArea      0.0061661599
## OverallQual -0.4184650698
## YearBuilt   -0.0177120055

The output of this object displays the prior probabilities of each group, in this case, 0.54 for each high price and 0.46 for low, the means of each predictor by group, and the coefficients of the linear discriminant classification model.

Once the classifier is built, new data can be classified based on its measurements simply by calling the predict function.

library(biotools)
## Warning: package 'biotools' was built under R version 4.3.2
## ---
## biotools version 4.2
pred <- predict(modelo_lda, dimen = 1)
confusionmatrix(price, pred$class)
##      new high new low
## high      678     104
## low        71     607
trainig_error <- mean(price != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 11.986301369863 %"

In this case, it has a 88.1% of success.

Visualization of predictions

The partimat function from the klaR package allows for the visualization of classification boundaries of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, and the centroid of each region, along with the actual values of the observations, is displayed.

suppressMessages(library(klaR))
## Warning: package 'klaR' was built under R version 4.3.2
partimat(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,
         data = data, method = "lda", prec = 200,
         image.colors = c("darkgoldenrod1", "snow2"),
         col.mean = "firebrick",nplots.vert =1, nplots.hor=3)

Cuadratic discriminant

Similar to Linear Discriminant Analysis, for Quadratic Discriminant Analysis, one begins with the graphical exploration of data and checks on univariate and multivariate normality and homogeneity of variances, which have already been performed previously.

Discriminant function

suppressMessages(library(MASS))
qda_model<- qda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,data = data)
qda_model
## Call:
## qda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + 
##     YearBuilt, data = data)
## 
## Prior probabilities of groups:
##      high       low 
## 0.5356164 0.4643836 
## 
## Group means:
##      GrLivArea GarageArea X1stFlrSF  LotArea OverallQual YearBuilt
## high  1684.239   553.1393  1268.873 9392.579    6.942591  1986.734
## low   1200.428   360.8582   986.655 9392.747    5.141750  1954.392

The output of this object displays the prior probabilities of each group, in this case, 0.54 and 0.46, and the means of each predictor by group.

Once the classifier is built, new data can be classified based on its measurements simply by calling the predict function.

Cross validation

pred <- predict(qda_model, dimen = 1)
confusionmatrix(price, pred$class)
##      new high new low
## high      681     101
## low        80     598
trainig_error <- mean(price != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 12.3972602739726 %"

The percentage of success is around 88%.

Visualization of predictions

The partimat function from the klaR package allows for the visualization of classification boundaries of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, and the centroid of each region, along with the actual values of the observations, is displayed.

suppressMessages(library(klaR))
partimat(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,
         data = data, method = "qda", prec = 200,
         image.colors = c("darkgoldenrod1", "snow2"),
         col.mean = "firebrick",nplots.vert =1, nplots.hor=3)

Cluster analysis

Hierarchical clustering

Hierarchical clustering is interested in finding a hierarchy based on the closeness or similarity of the data according to the distance considered. In the agglomerative case, we start from a group with the closest observations. The next closest pairs are then calculated and groups are generated in an ascending manner. This construction can be observed visually by means of a dendrogram.

Below it will be illustrated how the groups are defined by the number of vertical lines in the dendrogram, and the selection of the optimal number of groups can be estimated from this same graph.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()    masks ggplot2::%+%()
## ✖ psych::alpha()  masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ✖ tibble::view()  masks summarytools::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Package required to call 'clusGap' function
library(cluster)

# Package required to call 'get_dist', 'fviz_cluster' and 'fviz_dist' functions
library(factoextra)

# Package required to call 'ggdendrogram' function
library(ggdendro)
## Warning: package 'ggdendro' was built under R version 4.3.2
## 
## Attaching package: 'ggdendro'
## 
## The following object is masked from 'package:summarytools':
## 
##     label
# Package required to call 'grid.arrange' function
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
dendrogram <- hclust(dist(data, method = 'euclidean'), method = 'ward.D')
ggdendrogram(dendrogram, rotate = FALSE, labels = FALSE, theme_dendro = TRUE) + 
  labs(title = "Dendrograma")

K-means

The R language implements the K-means algorithm with the function of the same name. This function receives as input parameters the data and the number of groupings to be performed (centers parameter). To address the problem of choosing initial seed points it incorporates the nstart parameter that tests multiple initial configurations and reports on the best one. For example, if nstart = 25, it will generate 25 initial configurations. The use of this parameter is recommended.

k3 <- kmeans(data, centers = 3, nstart = 25)

str(k3)
## List of 9
##  $ cluster     : int [1:1460] 2 3 2 3 2 3 3 2 1 1 ...
##  $ centers     : num [1:3, 1:7] 1155 1733 1550 325 562 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:7] "GrLivArea" "GarageArea" "X1stFlrSF" "LotArea" ...
##  $ totss       : num 2.38e+12
##  $ withinss    : num [1:3] 1.73e+11 8.74e+10 1.34e+11
##  $ tot.withinss: num 3.94e+11
##  $ betweenss   : num 1.98e+12
##  $ size        : int [1:3] 461 272 727
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
fviz_cluster(k3,data=data)

set.seed(123)
k2 <- kmeans(data, centers = 2, nstart = 25)
k3 <- kmeans(data, centers = 3, nstart = 25)
k4 <- kmeans(data, centers = 4, nstart = 25)
k5 <- kmeans(data, centers = 5, nstart = 25)

# Plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = data) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = data) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = data) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = data) + ggtitle("k = 5")


grid.arrange(p1, p2, p3, p4, nrow = 2)

Optimal number of clusters

set.seed(123)
fviz_nbclust(data, kmeans, method = "wss")

4 would be the optimal amount

Silhouette method

set.seed(123)
fviz_nbclust(data, kmeans, method = "silhouette")

Gap statistical

set.seed(123)
gap_stat <- clusGap(data, FUN = kmeans, nstart = 25,K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Each method tell us a different number of clusters, but they tend to be around 2, so we will stick to 2 as the optimal amount of clusters.

Analysis of the result

set.seed(123)
final <- kmeans(data, 2, nstart = 25)
print(final)
## K-means clustering with 2 clusters of sizes 817, 643
## 
## Cluster means:
##   GrLivArea GarageArea X1stFlrSF  LotArea OverallQual YearBuilt SalePrice
## 1  1669.197   548.1040 1260.6442 9392.569    6.893643  1985.866  186152.9
## 2  1193.205   356.7898  981.7482 9392.769    5.105920  1953.735  122203.6
## 
## Clustering vector:
##    [1] 1 1 1 2 1 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1 2 2 1 2 1 1 2 1 2 1 1 1 1 2
##   [38] 2 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 1 1 2 1 2
##   [75] 2 2 2 2 2 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 1 1 2 2 2 1 1 2 1 1 1 2 2 2 1 2
##  [112] 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 2 1 2 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 2 2 1
##  [149] 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 2
##  [186] 1 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 1 2 2 1 1 1 1 2 1 2 1 1 1 1
##  [223] 1 2 1 2 1 2 2 1 2 1 2 2 1 2 1 1 1 2 1 2 2 2 1 1 2 2 1 1 2 1 1 1 2 1 1 1 1
##  [260] 2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 1 1 2
##  [297] 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 2 2 1
##  [334] 1 1 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1 1 2 2 2 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1
##  [371] 1 2 2 2 1 2 2 1 1 1 2 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 1 2 1 1 1 2 1 1 2 2
##  [408] 1 1 1 2 2 1 2 1 1 2 1 2 2 1 1 2 1 2 2 1 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1
##  [445] 1 2 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1
##  [482] 1 1 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 2 2 2 1 2 1 2 2 1 1 1 2 1 1 2 2 2 1 1 1
##  [519] 1 1 2 2 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 2 2 1 2 2 1 2 1
##  [556] 2 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 2 1 2 1 2 2 2 2 1 1
##  [593] 2 2 2 1 2 1 1 2 1 2 1 2 1 1 2 1 1 2 1 2 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 2
##  [630] 1 2 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1 2 1 2 1 2 2 2 2 1 1 1 2 2 1 1
##  [667] 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1
##  [704] 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 1 1 1
##  [741] 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1
##  [778] 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 1 2 1 2 2 1
##  [815] 2 1 2 1 1 1 1 2 1 2 1 1 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 1 1 2
##  [852] 1 1 1 1 2 2 1 2 1 1 2 2 2 1 2 1 2 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 2 2 1 2 2
##  [889] 1 2 2 1 1 1 2 2 2 2 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 2 1 1 1
##  [926] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1
##  [963] 1 1 1 1 1 2 1 2 2 1 2 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 1 2
## [1000] 1 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 2 2
## [1037] 1 1 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 2
## [1074] 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 1 1 1
## [1111] 1 1 2 2 2 1 1 2 2 2 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 1 2 2 1 1 2 2 2 1
## [1148] 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2
## [1185] 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 2 1 2 1 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2
## [1222] 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 2
## [1259] 1 2 1 2 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 1 1 2 2 1 2 2 1 1 1 1 2 2 1 2
## [1296] 2 1 2 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 1 1 1 2
## [1333] 2 2 2 1 2 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 1 2 2
## [1370] 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1
## [1407] 2 2 2 1 1 2 2 1 1 1 2 1 2 1 1 2 2 1 2 2 1 2 2 1 1 2 2 1 1 1 2 1 2 1 1 2 1
## [1444] 2 1 2 1 1 2 2 2 1 2 2 1 1 1 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 595618318126 311797081882
##  (between_SS / total_SS =  61.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_cluster(final, data = data)

data %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 2 × 8
##   Cluster GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt SalePrice
##     <int>     <dbl>      <dbl>     <dbl>   <dbl>       <dbl>     <dbl>     <dbl>
## 1       1     1669.       548.     1261.   9393.        6.89     1986.   186153.
## 2       2     1193.       357.      982.   9393.        5.11     1954.   122204.

As seen in the graph above, we separated the data into 2 clusters, which are not entirely heterogeneous; however, there is a certain homogeneity within both clusters, as observed in the table above. One of them groups data with higher mean values, meanwhile the second one groups data with lower values. In both clusters, the data for LotArea is quite similar. That can explain why both clusters intersect.